This version of practical manual includes output of the scripts and answers.
In this part of the course we will look into basic steps of the pathway enrichment analysis. There are many different methods, implementations and tools for performing this.
Some up-to-date web-based enrichment tools are, for example:
Here we will focus on ORA (overrepresentation analysis) and GSEA (Gene Set Enrichment Analysis), as implemented in ClusterProfiler R package (Yu et al. 2012).
# These packages are from CRAN
# install.packages('knitr') # HTML reports
# install.packages('kableExtra') # Simple HTML tables
# install.packages('data.table') # Fast data reader
# install.packages('ggplot2') # Plotting framework
# install.packages('dplyr') # Powerful package for data management
# These packages are from Bioconductor
# source("https://bioconductor.org/biocLite.R") # Indicate the package repository
# biocLite("clusterProfiler") # Package for enrichment tests
# biocLite("org.Hs.eg.db") # Human gene annotation package
# biocLite("pathview") # For visualization of enriched KEGG pathways
# biocLite("topGO") # Package for GO enrichment analysis
# biocLite("DOSE") # Package for disease enrichment analysislibrary(knitr)
library(kableExtra)
library(data.table)
library(clusterProfiler)
library(pathview)
library(ggplot2)# Create the new directory
#dir.create("~/PathwayAnalysisPracticalSession/")
# Change the working directoy
#setwd("~/PathwayAnalysisPracticalSesion/")Some of the commands we use in this practical use random processes, such as random permutations. This means that each time the script is ran, the outcome may differ very slightly, just by chance. To make your code reproducible it is good idea to define the number “seed” in the beginning of the script. This way the script gives exactly the same output every time it is ran.
set.seed(123)We will read in the results of differential expression analysis from previous practical. We will concentrate on anti-CD3-stimulated dataset in this practical.
cd3 <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_cd3_all.csv')Next we will do some data cleaning and convert gene IDs into usable format.
# Explore the data
summary(cd3)## V1 baseMean log2FoldChange lfcSE
## Length:23228 Min. : 0.0 Min. :-2.920 Min. :0.000
## Class :character 1st Qu.: 0.4 1st Qu.:-0.213 1st Qu.:0.178
## Mode :character Median : 40.2 Median :-0.005 Median :0.272
## Mean : 1516.7 Mean : 0.034 Mean :0.276
## 3rd Qu.: 982.1 3rd Qu.: 0.258 3rd Qu.:0.366
## Max. :1388488.9 Max. : 3.569 Max. :0.476
## NA's :3287 NA's :3287
## stat pvalue padj
## Min. :-8.123 Min. :0.000 Min. :0.000
## 1st Qu.:-0.923 1st Qu.:0.054 1st Qu.:0.124
## Median :-0.017 Median :0.321 Median :0.440
## Mean : 0.064 Mean :0.397 Mean :0.451
## 3rd Qu.: 1.059 3rd Qu.:0.727 3rd Qu.:0.757
## Max. : 8.822 Max. :1.000 Max. :1.000
## NA's :3287 NA's :3287 NA's :6744
# There are some log2FCs, lfcSE marked as "NAs". What is the reason?
head(cd3[is.na(cd3$log2FoldChange), ])## V1 baseMean log2FoldChange lfcSE stat pvalue padj
## 1: A4GNT 0 NA NA NA NA NA
## 2: AA06 0 NA NA NA NA NA
## 3: AACSP1 0 NA NA NA NA NA
## 4: AADAC 0 NA NA NA NA NA
## 5: AADACL3 0 NA NA NA NA NA
## 6: AADACL4 0 NA NA NA NA NA
# How many such genes?
table(cd3[is.na(cd3$log2FoldChange), ]$baseMean == 0)##
## TRUE
## 3287
We will remove the genes with mean base expression level 0, as those were not expressed in our data.
# Data cleaning, remove unexpressed genes (baseMean = 0)
cd3 <- cd3[!cd3$baseMean == 0, ]
# convert HGNC gene symbols to more stable ENTREZ IDs
entrez <- bitr(cd3$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
# Convert ENTREZ to text, so that R would process it correctly
entrez$ENTREZID <- as.character(entrez$ENTREZID)As you may see from the R message, ~10% of HGNC symbols were not connected with any ENTREZ ID. This is due to the unstable nature of HGNC annotation. In real-life science project you should opt for more stable annotation scheme (e.g. use one ENSEMBL/UCSC annotation database version throughout project).
For the sake of brevity and simplicity, in this practical we remove unlinked gene IDs from following steps and proceed with remaining 90% of genes.
# Add column with ENTREZ IDs to the table, remove rows where ENTREZ was missing
cd3 <- merge(cd3, entrez, by.x = 'V1', by.y = 'SYMBOL')We will prepare the data for ORA. In ORA we will use genes which are significantly differentically expressed (FDR<0.05, absolute fold-change>2). Fold change is often used as additional filter for identifying most relevant genes for interpretation.
# Prepare the data for overrepresentation tests
# Filter in significant (adjusted P<0.05) results
# Filter in only genes with larger than than 2-fold expression change
cd3_sig <- cd3[cd3$padj < 0.05 & abs(cd3$log2FoldChange) > 1, ]Next we will prepare data for GSEA analysis. For that we need to order all genes based on their correlation strenght with the phenotype or effect size. In our case we will order those based on \(log_2(FC)\). We will also extract the vector of all tested genes which can be used as “gene universe”.
# Order gene table by effect size (log2 fold-change), from largest to smallest
cd3 <- cd3[order(cd3$log2FoldChange, decreasing = T), ]
# Look at the range of fold changes
#png('fold_changes.png', units = 'in', height = 7, width = 7)
plot(cd3$log2FoldChange, xlab = 'Gene', ylab = 'log2(FC)')#dev.off()
# Prepare data for GSEA (vector of log2(FC)'s, named by ENTREZ IDs)
cd3_gsea <- cd3$log2FoldChange
names(cd3_gsea) <- cd3$ENTREZID
# This vector of gene names corresponds to all genes we tested in the analysis and will be used later as "gene universe"In this part we run enrichment test (hypergeometric test) for all differentially expressed genes (FDR<0.05, FC>2).
We use all genes tested in the RNA-profiling study as a background set or “gene universe”. This list was already constructed in the previous step.
We will query for all enrichment results and write out 25 most differentially expressed, not accounting the significance.
The default multiple testing is done by Benjamini-Hochberg method, flag pvalueCutoff can be used for filtering only significant results (<0.05). Additionally, qvalueCutoff flag is used to filter results based on another popular method of FDR estimation (Storey q-value).
The defaults of the command also apply restrictions on the sizes of gene sets tested in the analysis, those can be seen with command ?enrichKEGG.
KEGG_all <- enrichKEGG(gene = cd3_sig$ENTREZID,
organism = 'hsa',
universe = cd3$ENTREZID,
pvalueCutoff = 1,
qvalueCutoff = 1)
# this command converts ENTREZ IDs back to HGNC symbols to ease interpretation
KEGG_all <- setReadable(KEGG_all, OrgDb = 'org.Hs.eg.db', keytype="ENTREZID")We will accustom ourselves with the data structure of the analysis output.
# show the data structure
# str(KEGG_all)The resulting object is more complex R S4 data structure which consists separate “containers” for different results, analysis settings, etc. You can access to those containers by using @.
# look at some slots
head(KEGG_all@result, 5) # result table## ID Description GeneRatio
## hsa04060 hsa04060 Cytokine-cytokine receptor interaction 56/423
## hsa04640 hsa04640 Hematopoietic cell lineage 24/423
## hsa05321 hsa05321 Inflammatory bowel disease (IBD) 18/423
## hsa05310 hsa05310 Asthma 11/423
## hsa04110 hsa04110 Cell cycle 24/423
## BgRatio pvalue p.adjust qvalue
## hsa04060 253/6282 3.309626e-16 9.300048e-14 8.326321e-14
## hsa04640 93/6282 5.627749e-09 7.906987e-07 7.079116e-07
## hsa05321 63/6282 8.685665e-08 8.135573e-06 7.283768e-06
## hsa05310 27/6282 5.523284e-07 3.880107e-05 3.473855e-05
## hsa04110 124/6282 1.889612e-06 1.061962e-04 9.507733e-05
## geneID
## hsa04060 BMPR2/CCL1/CCL24/CCL3/CCL4/CCR1/CCR9/CD27/CD70/CSF1R/CSF2/CX3CR1/CXCL10/CXCL13/EDAR/IFNG/IL10/IL12RB2/IL13/IL13RA1/IL17A/IL17F/IL18RAP/IL19/IL1R2/IL21/IL22/IL23R/IL26/IL2RA/IL3/IL4/IL5/IL7/IL7R/IL9/IL9R/KDR/LEPR/LIF/LTA/OSM/PDGFA/PDGFB/PDGFC/TGFB2/TGFBR2/TNFRSF10D/TNFRSF11B/TNFRSF13B/TNFRSF18/TNFRSF4/TNFRSF8/TNFSF9/XCL1/XCL2
## hsa04640 CD33/CD36/CD38/CD9/CR1/CR2/CSF1R/CSF2/FCER2/HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IL1R2/IL2RA/IL3/IL4/IL5/IL7/IL7R/IL9R/ITGA5/ITGAM/TFRC
## hsa05321 HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IFNG/IL10/IL12RB2/IL13/IL17A/IL17F/IL18RAP/IL21/IL22/IL23R/IL4/IL5/MAF/TGFB2
## hsa05310 HLA-DMB/HLA-DOA/HLA-DQA1/HLA-DQB1/IL10/IL13/IL3/IL4/IL5/IL9/PRG2
## hsa04110 BUB1/BUB1B/CCNB1/CCNB2/CCND2/CCNE1/CDC20/CDC25A/CDC25C/CDC45/CDC6/CDKN2A/CDKN2B/CHEK1/E2F1/ESPL1/MAD2L1/MCM2/MCM4/ORC1/ORC6/TGFB2/TTK/WEE1
## Count
## hsa04060 56
## hsa04640 24
## hsa05321 18
## hsa05310 11
## hsa04110 24
KEGG_all@ontology # what ontology was tested## [1] "KEGG"
KEGG_all@pvalueCutoff # what P-value cutoff was used## [1] 1
Now we look at the results by printing out 25 first rows of the result table. View function in RStudio makes it very comfortable.
# remove 8. row which is long and uninformative
kable(head(KEGG_all@result[, -8], 25))| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | |
|---|---|---|---|---|---|---|---|---|
| hsa04060 | hsa04060 | Cytokine-cytokine receptor interaction | 56/423 | 253/6282 | 0.0000000 | 0.0000000 | 0.0000000 | 56 |
| hsa04640 | hsa04640 | Hematopoietic cell lineage | 24/423 | 93/6282 | 0.0000000 | 0.0000008 | 0.0000007 | 24 |
| hsa05321 | hsa05321 | Inflammatory bowel disease (IBD) | 18/423 | 63/6282 | 0.0000001 | 0.0000081 | 0.0000073 | 18 |
| hsa05310 | hsa05310 | Asthma | 11/423 | 27/6282 | 0.0000006 | 0.0000388 | 0.0000347 | 11 |
| hsa04110 | hsa04110 | Cell cycle | 24/423 | 124/6282 | 0.0000019 | 0.0001062 | 0.0000951 | 24 |
| hsa04630 | hsa04630 | Jak-STAT signaling pathway | 26/423 | 147/6282 | 0.0000041 | 0.0001940 | 0.0001737 | 26 |
| hsa05330 | hsa05330 | Allograft rejection | 11/423 | 35/6282 | 0.0000108 | 0.0004332 | 0.0003878 | 11 |
| hsa05140 | hsa05140 | Leishmaniasis | 15/423 | 71/6282 | 0.0000573 | 0.0020110 | 0.0018005 | 15 |
| hsa04658 | hsa04658 | Th1 and Th2 cell differentiation | 17/423 | 90/6282 | 0.0000837 | 0.0026146 | 0.0023408 | 17 |
| hsa04940 | hsa04940 | Type I diabetes mellitus | 10/423 | 39/6282 | 0.0001861 | 0.0052288 | 0.0046814 | 10 |
| hsa03460 | hsa03460 | Fanconi anemia pathway | 11/423 | 47/6282 | 0.0002155 | 0.0055063 | 0.0049298 | 11 |
| hsa04672 | hsa04672 | Intestinal immune network for IgA production | 10/423 | 45/6282 | 0.0006484 | 0.0151835 | 0.0135937 | 10 |
| hsa05320 | hsa05320 | Autoimmune thyroid disease | 10/423 | 48/6282 | 0.0011084 | 0.0239576 | 0.0214492 | 10 |
| hsa04659 | hsa04659 | Th17 cell differentiation | 16/423 | 105/6282 | 0.0016131 | 0.0316147 | 0.0283046 | 16 |
| hsa04610 | hsa04610 | Complement and coagulation cascades | 12/423 | 68/6282 | 0.0017079 | 0.0316147 | 0.0283046 | 12 |
| hsa04380 | hsa04380 | Osteoclast differentiation | 18/423 | 126/6282 | 0.0018001 | 0.0316147 | 0.0283046 | 18 |
| hsa05332 | hsa05332 | Graft-versus-host disease | 8/423 | 37/6282 | 0.0026938 | 0.0445276 | 0.0398655 | 8 |
| hsa05150 | hsa05150 | Staphylococcus aureus infection | 9/423 | 46/6282 | 0.0030767 | 0.0480301 | 0.0430013 | 9 |
| hsa04145 | hsa04145 | Phagosome | 19/423 | 144/6282 | 0.0034398 | 0.0508721 | 0.0455458 | 19 |
| hsa04151 | hsa04151 | PI3K-Akt signaling pathway | 35/423 | 326/6282 | 0.0036621 | 0.0514532 | 0.0460660 | 35 |
| hsa01521 | hsa01521 | EGFR tyrosine kinase inhibitor resistance | 12/423 | 79/6282 | 0.0061392 | 0.0821487 | 0.0735477 | 12 |
| hsa04350 | hsa04350 | TGF-beta signaling pathway | 12/423 | 82/6282 | 0.0082770 | 0.1057194 | 0.0946505 | 12 |
| hsa05166 | hsa05166 | HTLV-I infection | 27/423 | 249/6282 | 0.0089486 | 0.1093283 | 0.0978815 | 27 |
| hsa00380 | hsa00380 | Tryptophan metabolism | 7/423 | 37/6282 | 0.0105330 | 0.1233244 | 0.1104122 | 7 |
| hsa04657 | hsa04657 | IL-17 signaling pathway | 12/423 | 86/6282 | 0.0119855 | 0.1347165 | 0.1206115 | 12 |
Q1: How many KEGG pathways are significant if we consider Benjamini-Hochberg P<0.05?
A: 18
Q2: How many KEGG pathways are significant if we consider Storey Q<0.05?
A: 20
Q3: Authors of the original paper (Quinn et al. 2015, link) have already applied the KEGG overrepresentation test to the same data. Find the results from the publication and compare to your results- are those similar? If not exactly, speculate, what could be the reason(s) of observed differences?
A: Quite similar. Possible sources of difference:
-Possibly different ORA method.
-Possibly different database version.
-Possibly some difference in preprocessing of the data (e.g. normalization).
-Most probably different defaults of ORA (e.g. gene set size 10-500).
Q4: Use Fisher’s exact test to test the over-representation of the fifth most significant gene set enrichment result.
Google “Fisher’s Exact test R”
Construct the contigency table based on GeneRatio and BgRatio columns in result table- you can use this code snippet to replace “NAs” with the correct numbers from the result table:
GeneMatrixFisher <-
matrix(c(NA, NA, NA, NA),
nrow = 2,
dimnames = list(c("Diff_expressed", "Not_diff_expressed"),
c("Part_of_pathway", "Not_part_of_pathway")))Tip: you can use command addmargins on your GeneMatrixFisher to see if your contigency table adds up.
Use the GeneMatrixFisher to run Fisher’s Exact Test.
Q4.1: Report the used command, Fisher’s exact test P-value and odds ratio.
A:
Pathway:
Cell cycle
Contingency table:
GeneMatrixFisher <-
matrix(c(24, 124 - 24, 423 - 24, 6280 - 423 - (124 - 24)),
nrow = 2,
dimnames = list(c("Diff_expressed", "Not_diff_expressed"),
c("Part_of_pathway", "Not_part_of_pathway")))
GeneMatrixFisher## Part_of_pathway Not_part_of_pathway
## Diff_expressed 24 399
## Not_diff_expressed 100 5757
Command:
fisher.test(GeneMatrixFisher, alternative = 'greater')
Output:
fisher.test(GeneMatrixFisher, alternative = 'greater')##
## Fisher's Exact Test for Count Data
##
## data: GeneMatrixFisher
## p-value = 1.9e-06
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 2.270164 Inf
## sample estimates:
## odds ratio
## 3.461707
Q4.2: Does Fisher’s exact test give similar/same results as ClusterProfiler?
A: When considering rounding, yes. Hypergeometric test is actually identical with one-sided Fisher’s Exact Test.
# Fisher's Exact Test:
GeneMatrixFisher <-
matrix(c(32, (2050 - 32), (45-32), (6194 - 2050 - 13)),
nrow = 2,
dimnames = list(c("Part_of_pathway", "Not_part_of_pathway"),
c("Diff_exp", "Not_diff_exp")))
fisher.test(GeneMatrixFisher, alternative = 'greater')We can visualize our results on a barplot. Length of the bar shows significance (\(-log_{10}(P)\)) and the color shows number of genes overlapping with gene set.
input_barplot <- KEGG_all@result[, -8]
input_barplot$Description <- factor(input_barplot$Description, levels = rev(as.character(input_barplot$Description)))
# here we apply the default significance thresholds (Benjamini-Hochberg P<0.05 and Storey Q-value<0.2)
input_barplot <- input_barplot[input_barplot$p.adjust < 0.05 & input_barplot$qvalue < 0.2, ]
ggplot(input_barplot, aes(x = Description, y = -log10(pvalue), fill = Count)) + geom_bar(stat = 'identity') +
theme_classic() +
coord_flip() + scale_fill_continuous(low = 'lightblue', high = 'salmon')Next we visualize five most enriched pathways on the network graph. Size of the pathway node shows statistical significance and links indicate gene membership in the pathway. This gives the static graph as a output. If it is necessary to investigate further, you can use flag fixed = FALSE to see interactive version.
cnetplot(KEGG_all, categorySize = "pvalue", showCategory = 5, foldChange = cd3_gsea, cex = 0.1)# Interactive version
# cnetplot(KEGG_all, categorySize = "pvalue", showCategory = 5, foldChange = cd3_gsea, cex = 0.1, fixed = F)Q5: What are the most up- and down-regulated genes from 5 most enriched pathways, based on visual inspection?
A: Up-regulated: IFNG, down-regulated: KDR.
Q6: Is there any of the top pathways showing coordinated up- or down-regulation for the majority (or all) of its members?
A: Yes, “Cell cycle”, only one differentially expressed gene is down-regulated.
For KEGG pathways it is possible to visualize the location of differentially expressed genes on the pathway, as well as their magnitude of differential expression. This kind of visualization might help to identify most relevant genes for given condition and generate new research hypotheses.
pathview(gene.data = cd3_gsea,
pathway.id = KEGG_all@result[1, ]$ID,
species = "hsa",
limit = list(gene = max(abs(cd3_gsea)), cpd = 1),
out.suffix = "most_sig_pathway")For adding the plot to the HTML document: in the following command, replace the file name with the name of the file which was written to your work directory.

Q7: Same figure is also reported in the original publication (Quinn et al. 2015, link). Does it look similar or do you see any difference? Speculate, where can it come from?
A: Results are in quite good concordance with published results. There is some minor difference between two sets of results- this could be caused by the same reasons as explained in Q3.
Next we run the overrepresentation test for Gene Ontologies.
For the sake of brevity, we test only the ontologies from Biological Process category. By setting the ont flag, is also possible to test Molecular Function (MF), Cellular Component (CC) or all three categories combined (ALL).
For visualization purposes we apply this time significance thresholds (Benjamini-Hochberg FDR<0.05 and Storey FDR<0.05).
We specify that we test the enrichment for the gene sets with sizes between 10-10000 genes (flags minGSSize and maxGSSize)
GO_all <- enrichGO(gene = cd3_sig$ENTREZID,
universe = cd3$ENTREZID,
OrgDb = 'org.Hs.eg.db',
ont = 'BP',
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
minGSSize = 10,
maxGSSize = 10000)
# remove 8. row which is long and uninformative
kable(head(GO_all@result[, -8], 15))| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | |
|---|---|---|---|---|---|---|---|---|
| GO:0008283 | GO:0008283 | cell proliferation | 192/836 | 1756/14300 | 0 | 0 | 0 | 192 |
| GO:0002376 | GO:0002376 | immune system process | 242/836 | 2431/14300 | 0 | 0 | 0 | 242 |
| GO:0006950 | GO:0006950 | response to stress | 292/836 | 3271/14300 | 0 | 0 | 0 | 292 |
| GO:0006955 | GO:0006955 | immune response | 177/836 | 1681/14300 | 0 | 0 | 0 | 177 |
| GO:0007166 | GO:0007166 | cell surface receptor signaling pathway | 230/836 | 2405/14300 | 0 | 0 | 0 | 230 |
| GO:0006952 | GO:0006952 | defense response | 144/836 | 1305/14300 | 0 | 0 | 0 | 144 |
| GO:0042127 | GO:0042127 | regulation of cell proliferation | 148/836 | 1375/14300 | 0 | 0 | 0 | 148 |
| GO:0006260 | GO:0006260 | DNA replication | 51/836 | 270/14300 | 0 | 0 | 0 | 51 |
| GO:0050896 | GO:0050896 | response to stimulus | 522/836 | 7155/14300 | 0 | 0 | 0 | 522 |
| GO:0006954 | GO:0006954 | inflammatory response | 81/836 | 589/14300 | 0 | 0 | 0 | 81 |
| GO:0044763 | GO:0044763 | single-organism cellular process | 639/836 | 9331/14300 | 0 | 0 | 0 | 639 |
| GO:0051716 | GO:0051716 | cellular response to stimulus | 444/836 | 5880/14300 | 0 | 0 | 0 | 444 |
| GO:0051240 | GO:0051240 | positive regulation of multicellular organismal process | 138/836 | 1314/14300 | 0 | 0 | 0 | 138 |
| GO:1903047 | GO:1903047 | mitotic cell cycle process | 100/836 | 841/14300 | 0 | 0 | 0 | 100 |
| GO:0034097 | GO:0034097 | response to cytokine | 92/836 | 768/14300 | 0 | 0 | 0 | 92 |
To ease the interpretation and see the hierarchical relationships between most enriched GO terms, we visualize those on graph structure. Square boxes are significantly enriched terms and color depicts the significance (P-value). We will visualize 15 most significant terms.
#png('GO_plot.png', height = 10, width = 10, units = 'in', res = 400)
plotGOgraph(GO_all, firstSigNodes = 15)## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 53
## Number of Edges = 86
##
## $complete.dag
## [1] "A graph with 53 nodes."
#dev.off()Q8: What is the most significant GO term?
A: Cell proliferation
Q9: What are the main general themes coming out of the analysis?
A: Look at the hierarchical structure indicates mainly immune-related processes and cell proliferation.
Clusterprofiler allows to do also ORA for human diseases, making use of several databases of curated gene-disease relationships. As our trait of interest is Coeliac disease, it would be interesting to see differentially expressed genes also show enrichment of some human diseases and whether those are could be related with Coeliac disease or any other autoimmune disease.
Q10: If you see that the genes differentially expressed in your disease-of-interest (in our case Coeliac disease) are enriched by genes known to be associated with some other disease (e.g. inflammatory bowel disease, another autoimmune disease), what is the biological conclusion you make? Could this observation be useful for the search of the treatment and how would you proceed with that?
A: This could mean that there might be some similar mechanisms playing role in the pathogenesis of those two diseases. You can imagine that if there is known treatment for disease which comes significant, it might point you to leads for new potential drugs/treatments.
We will run DO enrichment analysis by asking all the results (no filtering based on FDR) and otherwise with default settings (minGSSize = 10, maxGSSize = 500).
DO_all <- enrichDO(gene = cd3_sig$ENTREZID,
universe = cd3$ENTREZID,
pvalueCutoff = 1,
qvalueCutoff = 1)
DO_all <- setReadable(DO_all, OrgDb = 'org.Hs.eg.db', keytype = "ENTREZID")
# remove 8. column which is long and uninformative
kable(head(DO_all@result[, -8], 15))| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | |
|---|---|---|---|---|---|---|---|---|
| DOID:850 | DOID:850 | lung disease | 77/524 | 458/6862 | 0e+00 | 0.00e+00 | 0.00e+00 | 77 |
| DOID:0050161 | DOID:0050161 | lower respiratory tract disease | 78/524 | 467/6862 | 0e+00 | 0.00e+00 | 0.00e+00 | 78 |
| DOID:74 | DOID:74 | hematopoietic system disease | 71/524 | 434/6862 | 0e+00 | 1.00e-07 | 0.00e+00 | 71 |
| DOID:1176 | DOID:1176 | bronchial disease | 33/524 | 137/6862 | 0e+00 | 3.00e-07 | 2.00e-07 | 33 |
| DOID:2841 | DOID:2841 | asthma | 30/524 | 119/6862 | 0e+00 | 4.00e-07 | 3.00e-07 | 30 |
| DOID:2320 | DOID:2320 | obstructive lung disease | 50/524 | 282/6862 | 0e+00 | 1.20e-06 | 8.00e-07 | 50 |
| DOID:2723 | DOID:2723 | dermatitis | 34/524 | 166/6862 | 1e-07 | 7.70e-06 | 5.00e-06 | 34 |
| DOID:37 | DOID:37 | skin disease | 51/524 | 317/6862 | 2e-07 | 1.73e-05 | 1.12e-05 | 51 |
| DOID:3905 | DOID:3905 | lung carcinoma | 67/524 | 470/6862 | 2e-07 | 1.85e-05 | 1.20e-05 | 67 |
| DOID:16 | DOID:16 | integumentary system disease | 55/524 | 357/6862 | 3e-07 | 1.85e-05 | 1.20e-05 | 55 |
| DOID:3388 | DOID:3388 | periodontal disease | 27/524 | 121/6862 | 3e-07 | 1.85e-05 | 1.20e-05 | 27 |
| DOID:3310 | DOID:3310 | atopic dermatitis | 28/524 | 130/6862 | 4e-07 | 2.23e-05 | 1.44e-05 | 28 |
| DOID:824 | DOID:824 | periodontitis | 24/524 | 104/6862 | 6e-07 | 3.79e-05 | 2.45e-05 | 24 |
| DOID:104 | DOID:104 | bacterial infectious disease | 41/524 | 243/6862 | 9e-07 | 4.75e-05 | 3.08e-05 | 41 |
| DOID:5082 | DOID:5082 | liver cirrhosis | 36/524 | 201/6862 | 1e-06 | 4.92e-05 | 3.18e-05 | 36 |
Q11: Does resulting significant disease list make sense in relation to Coeliac disease? Are there any diseases which could show similar expression profile as Coeliac disease? If not, speculate, what could be the reason of ambigous results?
A: Not really. It is not easy to interpret e.g. lung diseases in relation with Coeliac disease. Maybe “asthma” and “atopic dermatitis” might be in line as those are also immune-related diseases. However, you may see that most significant results are indicating quite general diseases (e.g. “lung disease”, “skin disease”, etc.) with >100 genes linked to the disease. This is due to hierchical nature of disease ontology, where there are also very broad “diseases” combining many potentially looseli related pathologies. This gives also quite broad enrichment results, as there are actually few “narrow-defined” diseases with so large number of known disease-related genes. It might be more informative to enforce upper limit for the size of tested gene sets- you can experiment with maxGSSize. Hint: Setting maxGSSize e.g. to 100, gives much more easily interpretable results, with immune-related diseases in top. However, you should always reason the choice of analysis settings based on analytical consideration (ideally before running the analysis), not based on which setting gives you the most appealing results!
Next we will use GSEA (Subramanian et al. 2005) on the diffrential analysis results. Unlike ORA, GSEA uses all the results from differential expression analysis, not only significant ones.
For the sake of speed and brevity, we use recommended minimal number of 1000 permutations for this analysis (to get more stable and precise results your could increase the number of permutations to e.g. 10000 in real scientific work). Also, we test only KEGG pathways which have more than 50 known gene members.
KEGG_GSEA <- gseKEGG(geneList = cd3_gsea,
organism = 'hsa',
nPerm = 1000,
minGSSize = 50,
pvalueCutoff = 1,
verbose = FALSE, seed = T)
kable(head(KEGG_GSEA@result[, -c(10:12)], 15))| ID | Description | setSize | enrichmentScore | NES | pvalue | p.adjust | qvalues | rank | |
|---|---|---|---|---|---|---|---|---|---|
| hsa04060 | hsa04060 | Cytokine-cytokine receptor interaction | 253 | 0.4542574 | 1.780635 | 0.0012422 | 0.0313647 | 0.0244104 | 1526 |
| hsa04630 | hsa04630 | Jak-STAT signaling pathway | 147 | 0.5254741 | 1.947766 | 0.0013643 | 0.0313647 | 0.0244104 | 1429 |
| hsa04114 | hsa04114 | Oocyte meiosis | 110 | 0.4603085 | 1.637844 | 0.0013947 | 0.0313647 | 0.0244104 | 3610 |
| hsa00240 | hsa00240 | Pyrimidine metabolism | 95 | 0.5396680 | 1.895844 | 0.0014164 | 0.0313647 | 0.0244104 | 2716 |
| hsa04110 | hsa04110 | Cell cycle | 124 | 0.5998859 | 2.158882 | 0.0014164 | 0.0313647 | 0.0244104 | 2786 |
| hsa01200 | hsa01200 | Carbon metabolism | 108 | 0.5164693 | 1.828463 | 0.0014184 | 0.0313647 | 0.0244104 | 3952 |
| hsa04657 | hsa04657 | IL-17 signaling pathway | 86 | 0.5680668 | 1.962633 | 0.0014409 | 0.0313647 | 0.0244104 | 2661 |
| hsa05132 | hsa05132 | Salmonella infection | 80 | 0.5272237 | 1.799571 | 0.0014535 | 0.0313647 | 0.0244104 | 2189 |
| hsa01230 | hsa01230 | Biosynthesis of amino acids | 66 | 0.5227021 | 1.724760 | 0.0015015 | 0.0313647 | 0.0244104 | 3968 |
| hsa05162 | hsa05162 | Measles | 129 | 0.4302471 | 1.559420 | 0.0028011 | 0.0418149 | 0.0325435 | 2151 |
| hsa05217 | hsa05217 | Basal cell carcinoma | 61 | -0.4873583 | -1.749591 | 0.0028902 | 0.0418149 | 0.0325435 | 2714 |
| hsa05321 | hsa05321 | Inflammatory bowel disease (IBD) | 63 | 0.5091961 | 1.664157 | 0.0030441 | 0.0418149 | 0.0325435 | 866 |
| hsa04380 | hsa04380 | Osteoclast differentiation | 126 | -0.4338584 | -1.761521 | 0.0034364 | 0.0418149 | 0.0325435 | 1088 |
| hsa03010 | hsa03010 | Ribosome | 130 | -0.4784401 | -1.961815 | 0.0034483 | 0.0418149 | 0.0325435 | 3535 |
| hsa04514 | hsa04514 | Cell adhesion molecules (CAMs) | 131 | -0.3546720 | -1.452435 | 0.0035211 | 0.0418149 | 0.0325435 | 3501 |
Q10: How many pathways are significant if we consider Benjamini-Hochberg FDR?
A: 11
Q11: Are the results in line with ORA results?
A: Generally yes.
print(head(KEGG_GSEA@result, 1)$Description)## [1] "Cytokine-cytokine receptor interaction"
gseaplot(KEGG_GSEA, geneSetID = head(KEGG_GSEA@result, 1)$ID)It is possible to easily compare the results several ORA analyses, using the functionality from compareCluster results. Next we read in and preprocess differentially expressed genes from all three conditions (UN-CeD vs control, unstimulated; CD3-CeD vs control, anti-CD3 stimulation; PMA-CeD vs control, PMA stimulation). We will run the KEGG ORA, using default settings and all genes tested in the study as an background.
# Read in and preprocess all different conditions:
# Unstimulated
unstim <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_unstim_all.csv')
unstim <- unstim[!unstim$baseMean == 0, ]
entrez <- bitr(unstim$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
unstim <- merge(unstim, entrez, by.x = 'V1', by.y = 'SYMBOL')
unstim_overrep <- unstim[unstim$padj < 0.05 & abs(unstim$log2FoldChange) > 1, ]
# CD3
cd3 <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_cd3_all.csv')
cd3 <- cd3[!cd3$baseMean == 0, ]
entrez <- bitr(cd3$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
cd3 <- merge(cd3, entrez, by.x = 'V1', by.y = 'SYMBOL')
cd3_overrep <- cd3[cd3$padj < 0.05 & abs(cd3$log2FoldChange) > 1, ]
# PMA
pma <- fread('/Users/urmovosa/Documents/Teaching/iBMS/practical/RNA_seq_part/DEG_lists/diffExpGenes_pma_all.csv')
pma <- pma[!pma$baseMean == 0, ]
entrez <- bitr(pma$V1, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
pma <- merge(pma, entrez, by.x = 'V1', by.y = 'SYMBOL')
pma_overrep <- pma[pma$padj < 0.05 & abs(pma$log2FoldChange) > 1, ]
input_comparison <- list(UNST = unstim_overrep$ENTREZID,
CD3 = cd3_overrep$ENTREZID,
PMA = pma_overrep$ENTREZID)
ck <- compareCluster(geneCluster = input_comparison, fun = "enrichKEGG",
universe = unstim$ENTREZID,
pvalueCutoff = 1,
qvalueCutoff = 1)We use dplyr and ggplot2 functionality to compare the 20 most significant pathways for each group. Here, the size of the dot indicates significance (\(-log_{10}(P)\)) and red dots indicate pathways withstanding multiple testing correction.
library(dplyr) # we load this package only now because of apparent conflict with "pathview" package
# Next command does the following:
# groups data based on stimulation (Cluster)
# from each stimulation takes out 20 genes with smallest P-values
# arranges all genes by P-value for visualization
# reorders factor levels of "Description", so that ggplot2 could use the correct order
# constructs variable "significance" to outline sets which are Benjamini-Hochberg FDR<0.05
# uses ggplot to visualize the plot
compare_input <- ck@compareClusterResult
compare_input %>%
group_by(Cluster) %>%
top_n(20, desc(pvalue)) %>%
arrange(pvalue) %>%
mutate(indic = row_number()) %>%
ungroup() %>%
mutate(Description = factor(Description, levels = rev(unique(as.character(Description))))) %>%
mutate(significance = case_when(p.adjust < 0.05 ~ 'sig.',
p.adjust >= 0.05 ~ 'not sig.')) %>%
ggplot(aes(y = Description, x = Cluster, colour = significance, size = -log10(pvalue))) +
geom_point(stat = 'identity') +
theme_bw() +
scale_colour_manual(values = c('black', 'red'))Q11: Are there any KEGG pathways overlapping between any of the conditions?
A: Yes, “Cytokine-cytokine receptor interaction”, “Inflammatory bowel disease”, between two stimulations.
Q12: What condition has the largest number of significantly enriched pathways?
A: Anti-CD3 stimulation.
Q13: Are the per-cohort results in line with the ORA results reported in the paper?
A: Generally yes, but there are also some differences.
Finally, when wrapping up your analysis, it is always a good idea to save the settings you have used. This involves software versions and the date of the analysis, making it possible to replicate or debug the results by different person.
Command sink is useful for saving any output from the screen into file.
Investigate the resulting file and note what information is saved.
sink("AnalysisInfo.txt") # Open the connection to file.
sessionInfo() # Prints information about R session.
Sys.time() # Prints the information about analysis time stamp.
sink() # Close the connection to file.Next advanced tasks are meant to quickest students who have already finalized previous analyses.
ClusterProfiler has a very useful functionality to allow performing ORA and GSEA for any gene set database. This means that you can define the gene set yourself or use some custom gene set database which is not explicitly implemented to ClusterProfiler.
We will try to use this functionality on anti-CD3 treated Coeliac disease differential expression results.
Task 1 Investigate ClusterProfiler manual here to identify which commands allow you to perform universal ORA and GSEA. Note what extra file(s) are needed to do so.
Task 2 There are numerous gene sets available and downloadable on the web tool enrichr website (http://amp.pharm.mssm.edu/Enrichr/). Locate the downloadable databases from the web site, investigate what kind of gene sets are available and select one you would like to test in your data. Also, think what gene set dataset gives interesting information about differentially expressed genes in Coeliac disase. Download the desirable file.
NB! Some of the data files might not work- then just make another choice for now. Also, some of the datasets include numerical values after each gene name- skip those for now.
Task 3 Unfortunately, the data files are not in the “R-friendly” format. The biggest challenge in this bonus task is to read the file in and convert it to the usable data.frame. It should finally look like that:
| term | gene |
|---|---|
| term1 | entrezID1 |
| term2 | entrezID2 |
Note that column names should be “term” and “gene”, and gene name should be ENTREZ ID.
Google and investigate read.table documentation, how to read tables with variable numbers of elements in each row into R.
fill, na.strings, col.names are needed. Also the usage of command count.fields is mandatory.If you manage to read the table into R, apply melt command from reshape2 package to convert data to long format, as you have previously learned in the R part. Install reshape2, if necessary.
Remove unnecessary column(s) and rows where gene name is “NA”. Make use of !is.na() for that.
Convert HGNC symbols to ENTREZ IDs, as we have done before. Do not forget to convert ENTREZ ID to character.
Merge the ENTREZ IDs to the intial table, manipulate the table to the format we need. Use the tricks you have learned today and on previous days for that.
Task 4 Run universal ORA, using differentially expressed genes from CD3 stimulation and your own gene set database. Put P-value and Q-value cutoffs to 1, to investigate top results even if none are significant.
Q: Did you find any significant and/or interpretable results?
Task 5 Run universal GSEA, using all ranked genes from CD3 stimulation and your own gene set database. Put P-value cutoff to 1, to investigate top results even if none are significant.
Q: Did you find any significant and/or interpretable results?
Task 6 Visualize your ORA and GSEA results using plots we have constructed today.
Convert the data into usable format.
# read in data
library(reshape2)
library(stringr)
database <- read.table('/Users/urmovosa/Downloads/TargetScan_microRNA.txt',
fill = T,
sep = '\t',
na.strings = '',
header = F,
col.names = paste0("V", 1:max(count.fields('/Users/urmovosa/Downloads/TargetScan_microRNA.txt', sep = '\t'))))
database <- database[, -2]
database <- melt(database, id.vars = 'V1')
database <- database[, -2]
colnames(database) <- c('term', 'gene')
database$gene <- str_replace(database$gene, ',.*', '')
entrez <- bitr(database$gene, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")
database <- merge(database, entrez, by.x = 'gene', by.y = 'SYMBOL')
database <- database[, c(2:3)]
colnames(database)[2] <- 'gene'
database$gene <- as.character(database$gene)Run ORA
# enrichment analysis
enricher_output <- enricher(cd3_sig$ENTREZID, pvalueCutoff = 1, qvalueCutoff = 1, pAdjustMethod = "BH", universe = cd3$ENTREZID,
minGSSize = 10, maxGSSize = 500, TERM2GENE = database,
TERM2NAME = NA)Run GSEA
gsea_output <- GSEA(cd3_gsea, pvalueCutoff = 1, pAdjustMethod = "BH",
minGSSize = 10, maxGSSize = 500, TERM2GENE = database,
TERM2NAME = NA)Quinn, Emma M., Ciara Coleman, Ben Molloy, Patricia Dominguez Castro, Paul Cormican, Valerie Trimble, Nasir Mahmud, and Ross McManus. 2015. “Transcriptome Analysis of Cd4+ T Cells in Coeliac Disease Reveals Imprint of Bach2 and IFN\(\gamma\) Regulation.” PLOS ONE 10 (10). Public Library of Science: e0140049. doi:10.1371/journal.pone.0140049.
Subramanian, Aravind, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50. doi:10.1073/pnas.0506580102.
Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “clusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS 16 (5). 140 Huguenot Street, 3rd FloorNew Rochelle, NY 10801USA: Mary Ann Liebert, Inc.: 284–87. doi:10.1089/omi.2011.0118[PII].